fix: load WASM grammars sequentially to avoid Node 20+ race condition by ravescovi · Pull Request #40 · colbymchenry/codegraph

ravescovi · 2026-02-17T16:33:54Z

Summary

Replace Promise.allSettled(entries.map(...)) with a sequential for...of loop in initGrammars() to avoid a known web-tree-sitter WASM race condition on Node.js 19+/20+
When multiple grammars with external scanners (TypeScript, TSX, C#, Swift, Kotlin, Dart, etc.) are loaded concurrently, V8's WebAssembly runtime hits a symbol resolution race where one grammar's exports overwrite another's GOT entries
This produces errors like bad export type for 'tree_sitter_tsx_external_scanner_create': undefined, causing those languages to silently fail to index

Root Cause

web-tree-sitter WASM instantiation is not safe for concurrent Language.load() calls on Node.js 19+ (V8 10.8+). The external scanner symbols from one grammar can collide with another's during parallel initialization.

Documented upstream:

Test plan

Verified on Node.js v20.20.0 (Linux x86_64) — all 16 grammars load successfully after the fix
Before fix: only Python indexed (303 files). After fix: Python + TSX + TypeScript + JavaScript + JSX (408 files)
No grammar Failed to load warnings in output after the change

web-tree-sitter has a known race condition when loading multiple WASM grammars concurrently on Node.js 19+ (V8 10.8+). External scanner symbols from one grammar can overwrite another's GOT entries, causing "bad export type" errors for TypeScript, TSX, and other languages. Replace Promise.allSettled(entries.map(...)) with a sequential for...of loop so each grammar fully initializes before the next one starts. Ref: tree-sitter/tree-sitter#2338

colbymchenry

Amazing work! Thank you!

Final wave of the codegraph tool-audit friction sweep. Polish: - colbymchenry#25 compare_to_ref now reports body-only-edited files explicitly. - colbymchenry#26 compare_to_ref includeEdges renders symbol names (not raw IDs), filters self-edges, drops empty file headers. - colbymchenry#27 codegraph_coverage gains sources (list) + drop modes; the two audit-residue coverage sources were removed from the index. - colbymchenry#28 role classifier ROLE_LIST_TEXT requires structural route/handler evidence — stops api_endpoint over-assignment from docstrings. - colbymchenry#29 status topBiomarkers emits an explicit clean/0-findings line. - colbymchenry#30 discover skips test-fixture indices (FIXTURE_DIR_NAMES). - colbymchenry#31 CLI reload-modules warns it has no lasting effect (ephemeral). - colbymchenry#32 session/note CLI --limit defaults aligned to MCP (20 / 50). - colbymchenry#33 CLI ask renders the verified-citations block (shared buildCitationReport helper, reused from the MCP path). - colbymchenry#34 codegraph_session gains a delete action + session delete CLI subcommand + deleteSession query helper. - colbymchenry#40 fuzzy-fallback banner extended to coverage + role symbol modes. Docs: - colbymchenry#35 find intent-mode hint references codegraph_graph, not the removed codegraph_callees / codegraph_walk. - colbymchenry#36 dead_code via=rule footer recommends via=llm. - colbymchenry#37 sql read-only rejection message names both the MCP and CLI schema-flag forms (surface-neutral). - colbymchenry#38 serve --no-write-tools help names real write-class tools. - colbymchenry#39 CLI local-chat help says "local LLM" to match MCP. Reviewer APPROVE; info findings (CLAUDE.md CLI docs, JSDoc wording) addressed. Suite 3037 passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rift/diff foundation) (#38) * feat(PF-690): schema v6 + per-symbol fingerprint columns for duplicate/drift/diff infrastructure First slice of the trace/duplicate/drift roadmap that Codex + agy debated in the design RFC. Pure data infrastructure — no new CLI/MCP surface yet. PR #39 (codegraph_diff), PR #40 (codegraph_duplicates), and PR #41 (codegraph_explain) will consume these columns. ## What changed - `src/extraction/fingerprints.ts` (new, ~245 lines): SHA-256 hashes computed from the in-memory tree-sitter subtree. - `astHash` (Type-1): normalized token stream with identifiers + literals preserved exactly. Detects "same code, only whitespace/comments differ". - `astShapeHash` (Type-2): identifier leaves in non-semantic positions replaced by `_ID`. Detects "same code, renamed locals". Property/field/type identifiers preserved by type; member-access targets, callees, kwarg names, type names, import names preserved by parent-context check. - `sigHash`: SHA-256 of the signature string. Null when no signature was extracted. - Comment + whitespace stripped; trivia tokens (commas, semicolons, braces) excluded via `namedChild` walk. - Schema v6 migration (`src/db/migrations.ts:93-119`, `src/db/schema.sql`): adds 4 nullable columns to `nodes` table — `ast_hash`, `ast_shape_hash`, `sig_hash`, `call_pattern_hash`. Partial indexes on the two body hashes (`WHERE NOT NULL`) so duplicate-detection sweeps are O(log N) lookups instead of full scans. `callPatternHash` is reserved for post-resolution population by a later PR. - Extraction wiring (`src/extraction/tree-sitter.ts:425-460`): computes the three body hashes from the already-parsed tree-sitter subtree inside `createNode`. agy's RFC point — that the AST is already in memory and hashing is microseconds — verified on a 107-file codegraph src/ corpus: 2.4s with vs 2.9s without (overhead below run-to-run variance, well under Codex's ≤15% budget). - `Node` interface (`src/types.ts:165-198`): adds nullable `astHash`, `astShapeHash`, `sigHash`, `callPatternHash` fields with provenance docstrings. - `queries.ts` insertNode / updateNode / rowToNode: round-trip the fingerprint columns nullably so framework-extractor synthesized route nodes (no body) keep `null` fingerprints — downstream consumers filter with `WHERE ast_hash IS NOT NULL`. ## v1 contract (Council RFC, locked by tests) - Detects: Type-1 (whitespace/comment-insensitive clones), Type-2 (renamed-locals clones). - Does NOT detect: Type-3 (statement reorder), Type-4 (semantic equivalence). - Literal values preserved → security/config code where the literal matters does not falsely conflate. Strongest counterpoint the council named ("miss literal-only differs") explicitly accepted. ## Bug-pin verified during review Codex pass 1 caught a real BLOCKER: `tree-sitter-python` parses `obj.start()` as `attribute(identifier "obj", identifier "start")` (both children are plain `identifier`), so a type-only rename rule would have conflated `obj.start()` with `obj.stop()`. Fix: parent- context check (`shouldPreserveIdentifier`) preserves identifiers in semantic positions — `attribute` children, `call.function` field, `keyword_argument` children, types, imports. Codex round 2 caught a follow-on: Python kwargs (`g(start=1)`) — added `keyword_argument` to the semantic-parent set. ## Tests (`__tests__/fingerprints.test.ts`, 14 cases) - sigHash determinism + null on missing signature. - Determinism: same input → same hex. - Type-1: whitespace/comment edits preserve astHash. - Type-2: renamed locals share astShapeHash, NOT astHash. - Member rename diverges (TS `property_identifier` path). - Literal change diverges (security sensitivity pinned). - Control-flow reorder diverges (Type-3 NOT detected, pinned). - Python regression: `obj.start()` vs `obj.stop()` diverge (member preserved despite both being `identifier`). - Python bare callee: `start()` vs `stop()` diverge. - Python kwarg: `g(start=1)` vs `g(stop=1)` diverge. - Python param rename: same astShapeHash, different astHash. - Cross-language: TS body ≠ Python body even when semantically equivalent. ## Reviewer trail - Codex pass 1: 1 BLOCKER (Python member conflation) + 1 REVIEW (missing Python tests) + 1 NITPICK (stale comment). - Codex round 2: BLOCKER + REVIEW CLOSED. New REVIEW (Python kwarg conflation) + NITPICK (header repeated stale claim). - Codex round 3: Both round-2 findings CLOSED. Last NITPICK (kwarg comment misrepresented the over-preservation trade-off). Codex authorized "iterate for the comment fix, then ship". - Doc comment now accurately describes the trade-off: kwarg set-membership preserves any direct identifier leaf including value-side identifiers; tighter field-specific check deferred to a follow-up. ## Verification - tsc --noEmit clean - npm test: 1026 passed | 2 skipped (was 1012 on main; +14 fingerprint tests) - npm run test:eval:structural: 8/8 PASS, recall=1.00 precision=1.00 fp=0 (no regression vs main baseline) - Index-time delta on 107-file corpus: 2.4s with vs 2.9s without — below run-to-run variance, well under ≤15% target. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(PF-690): cross-language member preservation + idempotent v6 migration (Codex round 4) Codex round 4 deep sweep verified via real `tree-sitter-wasms` parses that the v1 fingerprint rule conflated semantically different code in Ruby, Java, C#, and Rust. The original rule only handled Python's `attribute` shape + the `call.function` field. Each of these four languages emits plain `identifier` for member/callee positions but under DIFFERENT parent node types: - Ruby: `user.start` -> call(identifier "user", identifier "start"), method field carries the member name (not `function`). - Java: `obj.start()` -> method_invocation(identifier, identifier). - C#: `obj.Start()` -> invocation_expression > member_access_expression(identifier, identifier). - Rust: `Router::new()` -> call_expression > scoped_identifier(identifier "Router", identifier "new"). Fix: extend `SEMANTIC_PARENT_TYPES` with `method_invocation`, `member_access_expression`, `invocation_expression`, `scoped_identifier`, `scoped_call_expression`, `field_expression`. Add `call.method` field check to `shouldPreserveIdentifier` to cover Ruby's dual-purpose `call` type. Same set-membership v1 trade-off applies (accepts false negative on receiver names rather than risk semantic-name conflation). Plus Codex round 4 REVIEW: migration v6 was not idempotent under concurrent-open race. Two processes hitting a v5 database could both read version 5, both enter migration, and the second's `ALTER TABLE ADD COLUMN` would fail with duplicate-column even though the resulting schema is fine. Fixed via `PRAGMA table_info` pre-check per column so already-applied additions become no-ops. `CREATE INDEX IF NOT EXISTS` was already idempotent. Tests added (4 cross-language regressions): - Ruby `user.start` vs `user.stop` -> different astShapeHash - Java `obj.start()` vs `obj.stop()` -> different astShapeHash - C# `obj.Start()` vs `obj.Stop()` -> different astShapeHash - Rust `Router::new()` vs `Router::default()` -> different astShapeHash Each pins the specific cross-language failure mode Codex verified. Reviewer trail: - Codex round 4 (deep sweep, 6 attack vectors): found 1 BLOCKER (cross-language conflation) + 1 REVIEW (migration race). Both fixed; remaining 4 vectors confirmed clean (hash determinism, persistence completeness, ERROR/MISSING handling, createNode hook coverage). - CodeRabbit CLI: ran against the same diff, no findings. - Claude Explore subagent: returned 7 findings; 3 already covered here (cross-language tests, migration safety), 4 deferred as documentation/contract clarifications (line-ending CRLF normalization, downstream kwarg trade-off doc, callPatternHash contract clarity, SQLite version compat — node:sqlite ships SQLite 3.42+ which fully supports partial indexes). Verification: - tsc --noEmit clean - npm test: 1030 passed | 2 skipped (was 1026 last commit; +4 cross-language tests) - npm run test:eval:structural: 8/8 PASS, recall=1.00 precision=1.00 fp=0 (no regression vs baseline) - All 18 fingerprint tests pass deterministically. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

colbymchenry approved these changes Feb 19, 2026

View reviewed changes

colbymchenry merged commit 5d699ab into colbymchenry:main Feb 19, 2026

colbymchenry mentioned this pull request Feb 19, 2026

tree-sitter grammar loading is broken for most languages. #44

Closed

mbenhamd mentioned this pull request May 24, 2026

feat(PF-690): schema v6 + per-symbol fingerprint columns (duplicate/drift/diff foundation) mbenhamd/codegraph#38

Merged

5 tasks

mbenhamd mentioned this pull request May 24, 2026

feat(PF-691): codegraph diff <oldDb> <newDb> DB-vs-DB structural primitive mbenhamd/codegraph#39

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: load WASM grammars sequentially to avoid Node 20+ race condition#40

fix: load WASM grammars sequentially to avoid Node 20+ race condition#40
colbymchenry merged 1 commit into
colbymchenry:mainfrom
ravescovi:fix/sequential-grammar-loading

ravescovi commented Feb 17, 2026

Uh oh!

colbymchenry left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ravescovi commented Feb 17, 2026

Summary

Root Cause

Test plan

Uh oh!

colbymchenry left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants